Linear regression: Prediction with regression models

DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods

Abstraction ⇝ Data

CS 10.1 revisited: Dungeness crab growth

Fitted Model Equation: Premolti = β0 + β1 × Postmolti + εi, where εi ~ Normal(0, σε)

CS 10.1 revisited: Premolt and residuals “after”

CS 10.2 revisited: 24 hour exams

Fitted Model Equation: Scorei = β0 + β1 × Submiti + εi, where εi ~ Normal(0, σε)

CS 10.2 revisited: Score and residuals “after”

Briefly

  • We had to assume that there was a linear relationship between the response and explanatory variables
  • Inference for β1 and β0 is consequently about the average (mean) response changes as the explanatory changes
  • Like the one-way analysis of variance, we could decompose the “variability” of the response variable. But instead, to see if the best-fit line explained more of the “variability” than the residuals

Inference for the mean response…?

Adjusting the goalposts with regression model

We have fitted the best-fit line describing the average response with an explanatory variable

coef(crabs.fit)
(Intercept)    Postmolt 
 -28.947433    1.099481 

\(\widehat{\text{Premolt}}_i = -28.95 + 1.10 \times \text{Postmolt}_i\)

What if we wanted to infer plausible values of average response and the response itself?

This is known as making predictions with your fitted model, and why the explanatory variable is often called a predictor

Multiple R2

Multiple \(R^2\) describes the proportion of the total variability of the response variable that is explained by the fitted model

anova(crabs.fit)
Analysis of Variance Table

Response: Premolt
           Df Sum Sq Mean Sq F value    Pr(>F)    
Postmolt    1  41393   41393   10389 < 2.2e-16 ***
Residuals 359   1430       4                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(crabs.fit)

Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6485 -1.3060  0.0829  1.2683 11.0291 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -28.94743    1.55480  -18.62   <2e-16 ***
Postmolt      1.09948    0.01079  101.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared:  0.9666,    Adjusted R-squared:  0.9665 
F-statistic: 1.039e+04 on 1 and 359 DF,  p-value: < 2.2e-16

Interpretation of Multiple R2

Our fitted (simple linear regression) model describes 96.6% of the variability in the premolt sizes of female Dungeness crabs

Is there an “optimal” value for Multiple \(R^2\)?

summary(crabs.fit)

Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6485 -1.3060  0.0829  1.2683 11.0291 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -28.94743    1.55480  -18.62   <2e-16 ***
Postmolt      1.09948    0.01079  101.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared:  0.9666,    Adjusted R-squared:  0.9665 
F-statistic: 1.039e+04 on 1 and 359 DF,  p-value: < 2.2e-16

Confidence interval for the mean response

It is an interval that can capture the “true” mean response for all elements of the “population” for a specific explanatory variable value (based on the fitted linear regression model)

Two sources of uncertainty\(\beta_0\) and \(\beta_1\)

For female Dungeness crabs whose postmolt size is 145 mm, we estimate with 95% confidence that their average premolt size somewhere between 126.5 and 128.4 mm

predict(crabs.fit, level = 0.95,
        newdata = data.frame(Postmolt = c(125, 135, 145, 155)),
        interval = "confidence")
       fit      lwr      upr
1 108.4877 108.0384 108.9370
2 119.4825 119.2039 119.7610
3 130.4773 130.2691 130.6854
4 141.4721 141.1574 141.7868

Comparing ˆy of two different x values?

We can leverage the emmeans package to do the mathematics for us…

library(emmeans)
emmeans(crabs.fit, ~ Postmolt, at = list(Postmolt = c(125, 135, 145))) |>
  pairs(infer = c(TRUE, FALSE)) # Just to "chop-off" p-values for this example
 contrast                  estimate    SE  df lower.CL upper.CL
 Postmolt125 - Postmolt135      -11 0.108 359    -11.2    -10.7
 Postmolt125 - Postmolt145      -22 0.216 359    -22.5    -21.5
 Postmolt135 - Postmolt145      -11 0.108 359    -11.2    -10.7

Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 3 estimates 

We are 95% sure that the difference between that the mean premolt size of female Dungeness crabs whose postmolt size is 125 mm is somewhere between 10.7 and 11.2 mm lower than that of crabs whose postmolt size is 135 mm

Prediction interval for an individual response

It is an interval that can capture a plausible response value for a specific observation in the “population” for a specific explanatory variable value (based on the fitted linear regression model)

Three sources of uncertainty\(\beta_0\), \(\beta_1\), and \(\sigma_\varepsilon\)

We predict with 95% confidence that a female Dungeness crab’s premolt size is somewhere between 126.5 and 128.4 mm if its postmolt size is 145 mm

predict(crabs.fit, level = 0.95,
        newdata = data.frame(Postmolt = c(125, 135, 145, 155)),
        interval = "prediction")
       fit      lwr      upr
1 108.4877 104.5366 112.4387
2 119.4825 115.5472 123.4178
3 130.4773 126.5463 134.4082
4 141.4721 137.5341 145.4101

Cautions with predictions

  • The assumptions for inference with a regression model have been met
  • Do not extrapolate. That is, do not predict outside the range of the explanatory variable
  • A “narrow” prediction interval is relative to the spread of the response variable
  • Apply common sense

CS 10.1 revisited: Uncertainty of its best-fit line

The red-dashed lines are the 95% CI for the mean response and blue-dashed lines are the 95% PI for an individual response

CS 10.2 revisited: Uncertainty of its best-fit line

The red-dashed lines are the 95% CI for the mean response and blue-dashed lines are the 95% PI for an individual response

R code for Slides 16 & 17

For crabs.df:

plot(Premolt ~ Postmolt, data = crabs.df, main = "Premolt vs postmolt sizes of dungeness crabs",
       xlab = "Postmolt size (mm)", ylab = "Premolt size (mm)", col = rgb(0, 0, 0, alpha = 0.25))
crabs.fit <- lm(Premolt ~ Postmolt, data = crabs.df)
x <- seq(100, 180, length.out = 100)
y <- predict(crabs.fit, newdata = data.frame(Postmolt = x), interval = "prediction")
yhat <- predict(crabs.fit, newdata = data.frame(Postmolt = x), interval = "confidence")
abline(crabs.fit, col = "red")
lines(x = x, y = y[, 2], col = "blue", lty = 5)
lines(x = x, y = y[, 3], col = "blue", lty = 5)
lines(x = x, y = yhat[, 2], col = "red", lty = 5)
lines(x = x, y = yhat[, 3], col = "red", lty = 5)

For submit.df:

plot(Score ~ Submit, data = submit.df, main = "Exam marks vs final submission times of students", xlim = c(0, 24),
     xlab = "Final submission time (hours)", ylab = "Exam mark (out of 80)", col = rgb(0, 0, 0, alpha = 0.25), xaxs = "i")
submit.fit <- lm(Score ~ Submit, data = submit.df)
x <- seq(0, 24, length.out = 100)
y <- predict(submit.fit, newdata = data.frame(Submit = x), interval = "prediction")
yhat <- predict(submit.fit, newdata = data.frame(Submit = x), interval = "confidence")
abline(submit.fit, col = "red")
lines(x = x, y = y[, 2], col = "blue", lty = 5)
lines(x = x, y = y[, 3], col = "blue", lty = 5)
lines(x = x, y = yhat[, 2], col = "red", lty = 5)
lines(x = x, y = yhat[, 3], col = "red", lty = 5)